{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Engaging with Multimodal Models: GPT-4V in AG2\n",
    "\n",
    "In AG2, leveraging multimodal models can be done through two different methodologies:\n",
    "1. **MultimodalAgent**: Supported by GPT-4V and other LMMs, this agent is endowed with visual cognitive abilities, allowing it to engage in interactions comparable to those of other ConversableAgents.\n",
    "2. **VisionCapability**: For LLM-based agents lacking inherent visual comprehension, we introduce vision capabilities by converting images into descriptive captions.\n",
    "\n",
    "This guide will delve into each approach, providing insights into their application and integration."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "### Before everything starts, install AG2 with the `lmm` option\n",
    "\n",
    "Install `ag2`:\n",
    "```bash\n",
    "pip install \"ag2[lmm]>=0.2.17\"\n",
    "```\n",
    "\n",
    "For more information, please refer to the [installation guide](https://docs.ag2.ai/latest/docs/quick-start).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from PIL import Image\n",
    "\n",
    "import autogen\n",
    "from autogen import Agent, AssistantAgent, ConversableAgent, LLMConfig, UserProxyAgent\n",
    "from autogen.agentchat.contrib.capabilities.vision_capability import VisionCapability\n",
    "from autogen.agentchat.contrib.img_utils import pil_to_data_uri\n",
    "from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent\n",
    "from autogen.code_utils import content_str"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "<a id=\"app-1\"></a>\n",
    "## Application 1: Image Chat\n",
    "\n",
    "In this section, we present a straightforward dual-agent architecture to enable user to chat with a multimodal agent.\n",
    "\n",
    "\n",
    "First, we show this image and ask a question.\n",
    "![](https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "Within the user proxy agent, we can decide to activate the human input mode or not (for here, we use human_input_mode=\"NEVER\" for conciseness). This allows you to interact with LMM in a multi-round dialogue, enabling you to provide feedback as the conversation unfolds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm_config_4v = LLMConfig.from_json(path=\"OAI_CONFIG_LIST\", temperature=0.5, max_tokens=300).where(\n",
    "    model=[\"gpt-4-vision-preview\"]\n",
    ")\n",
    "\n",
    "\n",
    "gpt4_llm_config = LLMConfig.from_json(path=\"OAI_CONFIG_LIST\", cache_seed=42).where(\n",
    "    model=[\"gpt-4\", \"gpt-4-0314\", \"gpt4\", \"gpt-4-32k\", \"gpt-4-32k-0314\", \"gpt-4-32k-v0314\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "Learn more about configuring LLMs for agents [here](https://docs.ag2.ai/latest/docs/user-guide/basic-concepts/llm-configuration)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "image_agent = MultimodalConversableAgent(\n",
    "    name=\"image-explainer\",\n",
    "    max_consecutive_auto_reply=10,\n",
    "    llm_config=llm_config_4v,\n",
    ")\n",
    "\n",
    "user_proxy = autogen.UserProxyAgent(\n",
    "    name=\"User_proxy\",\n",
    "    system_message=\"A human admin.\",\n",
    "    human_input_mode=\"NEVER\",  # Try between ALWAYS or NEVER\n",
    "    max_consecutive_auto_reply=0,\n",
    "    code_execution_config={\n",
    "        \"use_docker\": False\n",
    "    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.\n",
    ")\n",
    "\n",
    "# Ask the question with an image\n",
    "user_proxy.initiate_chat(\n",
    "    image_agent,\n",
    "    message=\"\"\"What's the breed of this dog?\n",
    "<img https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0>.\"\"\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "Now, input another image, and ask a followup question.\n",
    "\n",
    "![](https://th.bing.com/th/id/OIP.29Mi2kJmcHHyQVGe_0NG7QHaEo?pid=ImgDet&rs=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ask the question with an image\n",
    "user_proxy.send(\n",
    "    message=\"\"\"What is this breed?\n",
    "<img https://th.bing.com/th/id/OIP.29Mi2kJmcHHyQVGe_0NG7QHaEo?pid=ImgDet&rs=1>\n",
    "\n",
    "Among the breeds, which one barks less?\"\"\",\n",
    "    recipient=image_agent,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "<a id=\"app-2\"></a>\n",
    "## Application 2: Figure Creator\n",
    "\n",
    "Here, we define a `FigureCreator` agent, which contains three child agents: commander, coder, and critics.\n",
    "\n",
    "- Commander: interacts with users, runs code, and coordinates the flow between the coder and critics.\n",
    "- Coder: writes code for visualization.\n",
    "- Critics: LMM-based agent that provides comments and feedback on the generated image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "working_dir = \"tmp/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "class FigureCreator(ConversableAgent):\n",
    "    def __init__(self, n_iters=2, **kwargs):\n",
    "        \"\"\"Initializes a FigureCreator instance.\n",
    "\n",
    "        This agent facilitates the creation of visualizations through a collaborative effort among its child agents: commander, coder, and critics.\n",
    "\n",
    "        Parameters:\n",
    "            - n_iters (int, optional): The number of \"improvement\" iterations to run. Defaults to 2.\n",
    "            - **kwargs: keyword arguments for the parent AssistantAgent.\n",
    "        \"\"\"\n",
    "        super().__init__(**kwargs)\n",
    "        self.register_reply([Agent, None], reply_func=FigureCreator._reply_user, position=0)\n",
    "        self._n_iters = n_iters\n",
    "\n",
    "    def _reply_user(self, messages=None, sender=None, config=None):\n",
    "        if all((messages is None, sender is None)):\n",
    "            error_msg = f\"Either {messages=} or {sender=} must be provided.\"\n",
    "            logger.error(error_msg)  # noqa: F821\n",
    "            raise AssertionError(error_msg)\n",
    "        if messages is None:\n",
    "            messages = self._oai_messages[sender]\n",
    "\n",
    "        user_question = messages[-1][\"content\"]\n",
    "\n",
    "        # Define the agents\n",
    "        commander = AssistantAgent(\n",
    "            name=\"Commander\",\n",
    "            human_input_mode=\"NEVER\",\n",
    "            max_consecutive_auto_reply=10,\n",
    "            system_message=\"Help me run the code, and tell other agents it is in the <img result.jpg> file location.\",\n",
    "            is_termination_msg=lambda x: x.get(\"content\", \"\").rstrip().endswith(\"TERMINATE\"),\n",
    "            code_execution_config={\"last_n_messages\": 3, \"work_dir\": working_dir, \"use_docker\": False},\n",
    "            llm_config=self.llm_config,\n",
    "        )\n",
    "\n",
    "        critics = MultimodalConversableAgent(\n",
    "            name=\"Critics\",\n",
    "            system_message=\"\"\"Criticize the input figure. How to replot the figure so it will be better? Find bugs and issues for the figure.\n",
    "            Pay attention to the color, format, and presentation. Keep in mind of the reader-friendliness.\n",
    "            If you think the figures is good enough, then simply say NO_ISSUES\"\"\",\n",
    "            llm_config=llm_config_4v,\n",
    "            human_input_mode=\"NEVER\",\n",
    "            max_consecutive_auto_reply=1,\n",
    "            #     use_docker=False,\n",
    "        )\n",
    "\n",
    "        coder = AssistantAgent(\n",
    "            name=\"Coder\",\n",
    "            llm_config=self.llm_config,\n",
    "        )\n",
    "\n",
    "        coder.update_system_message(\n",
    "            coder.system_message\n",
    "            + \"ALWAYS save the figure in `result.jpg` file. Tell other agents it is in the <img result.jpg> file location.\"\n",
    "        )\n",
    "\n",
    "        # Data flow begins\n",
    "        commander.initiate_chat(coder, message=user_question)\n",
    "        img = Image.open(os.path.join(working_dir, \"result.jpg\"))\n",
    "        plt.imshow(img)\n",
    "        plt.axis(\"off\")  # Hide the axes\n",
    "        plt.show()\n",
    "\n",
    "        for i in range(self._n_iters):\n",
    "            commander.send(\n",
    "                message=f\"Improve <img {os.path.join(working_dir, 'result.jpg')}>\",\n",
    "                recipient=critics,\n",
    "                request_reply=True,\n",
    "            )\n",
    "\n",
    "            feedback = commander._oai_messages[critics][-1][\"content\"]\n",
    "            if feedback.find(\"NO_ISSUES\") >= 0:\n",
    "                break\n",
    "            commander.send(\n",
    "                message=\"Here is the feedback to your figure. Please improve! Save the result to `result.jpg`\\n\"\n",
    "                + feedback,\n",
    "                recipient=coder,\n",
    "                request_reply=True,\n",
    "            )\n",
    "            img = Image.open(os.path.join(working_dir, \"result.jpg\"))\n",
    "            plt.imshow(img)\n",
    "            plt.axis(\"off\")  # Hide the axes\n",
    "            plt.show()\n",
    "\n",
    "        return True, os.path.join(working_dir, \"result.jpg\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "creator = FigureCreator(name=\"Figure Creator~\", llm_config=gpt4_llm_config)\n",
    "\n",
    "user_proxy = autogen.UserProxyAgent(\n",
    "    name=\"User\", human_input_mode=\"NEVER\", max_consecutive_auto_reply=0, code_execution_config={\"use_docker\": False}\n",
    ")\n",
    "\n",
    "user_proxy.initiate_chat(\n",
    "    creator,\n",
    "    message=\"\"\"\n",
    "Plot a figure by using the data from:\n",
    "https://raw.githubusercontent.com/vega/vega/main/docs/data/seattle-weather.csv\n",
    "\n",
    "I want to show both temperature high and low.\n",
    "\"\"\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "## Vision Capability: Group Chat Example with Multimodal Agent\n",
    "\n",
    "We recommend using VisionCapability for group chat managers so that it can organize and understand images better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "agent1 = MultimodalConversableAgent(\n",
    "    name=\"image-explainer-1\",\n",
    "    max_consecutive_auto_reply=10,\n",
    "    llm_config=llm_config_4v,\n",
    "    system_message=\"Your image description is poetic and engaging.\",\n",
    ")\n",
    "agent2 = MultimodalConversableAgent(\n",
    "    name=\"image-explainer-2\",\n",
    "    max_consecutive_auto_reply=10,\n",
    "    llm_config=llm_config_4v,\n",
    "    system_message=\"Your image description is factual and to the point.\",\n",
    ")\n",
    "\n",
    "user_proxy = autogen.UserProxyAgent(\n",
    "    name=\"User_proxy\",\n",
    "    system_message=\"Describe image for me.\",\n",
    "    human_input_mode=\"TERMINATE\",  # Try between ALWAYS, NEVER, and TERMINATE\n",
    "    max_consecutive_auto_reply=10,\n",
    "    code_execution_config={\n",
    "        \"use_docker\": False\n",
    "    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.\n",
    ")\n",
    "\n",
    "# We set max_round to 5\n",
    "groupchat = autogen.GroupChat(agents=[agent1, agent2, user_proxy], messages=[], max_round=5)\n",
    "\n",
    "vision_capability = VisionCapability(lmm_config=llm_config_4v)\n",
    "group_chat_manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=gpt4_llm_config)\n",
    "vision_capability.add_to_agent(group_chat_manager)\n",
    "\n",
    "rst = user_proxy.initiate_chat(\n",
    "    group_chat_manager,\n",
    "    message=\"\"\"Write a poet for my image:\n",
    "                        <img https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0>.\"\"\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "## Behavior with and without VisionCapability for Agents\n",
    "\n",
    "\n",
    "Here, we show the behavior of an agent with and without VisionCapability. We use the same image and question as in the previous example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "agent_no_vision = AssistantAgent(name=\"Regular LLM Agent\", llm_config=gpt4_llm_config)\n",
    "\n",
    "agent_with_vision = AssistantAgent(name=\"Regular LLM Agent with Vision Capability\", llm_config=gpt4_llm_config)\n",
    "vision_capability = VisionCapability(lmm_config=llm_config_4v)\n",
    "vision_capability.add_to_agent(agent_with_vision)\n",
    "\n",
    "\n",
    "user = UserProxyAgent(\n",
    "    name=\"User\",\n",
    "    human_input_mode=\"NEVER\",\n",
    "    max_consecutive_auto_reply=0,\n",
    "    code_execution_config={\"use_docker\": False},\n",
    ")\n",
    "\n",
    "message = \"\"\"Write a poet for my image:\n",
    "                        <img https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0>.\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "user.send(message=message, recipient=agent_no_vision, request_reply=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "user.send(message=message, recipient=agent_with_vision, request_reply=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20",
   "metadata": {},
   "source": [
    "## Custom Caption Function for Vision Capability\n",
    "\n",
    "In many use cases, we can use a custom function within the Vision Capability to transcribe an image into a caption.\n",
    "\n",
    "For instance, we can use rule-based algorithm or other models to detect the color, box, and other components inside the image.\n",
    "\n",
    "The custom model should take a path to the image and return a string caption.\n",
    "\n",
    "In the example below, the Vision Capability will call LMM to get caption and also call the custom function to get more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "def my_description(image_url: str, image_data: Image = None, lmm_client: object = None) -> str:\n",
    "    \"\"\"This function takes an image URL and returns the description.\n",
    "\n",
    "    Parameters:\n",
    "        - image_url (str): The URL of the image.\n",
    "        - image_data (PIL.Image): The image data.\n",
    "        - lmm_client (object): The LLM client object.\n",
    "\n",
    "    Returns:\n",
    "        - str: A description of the color of the image.\n",
    "    \"\"\"\n",
    "    # Print the arguments for illustration purpose\n",
    "    print(\"image_url\", image_url)\n",
    "    print(\"image_data\", image_data)\n",
    "    print(\"lmm_client\", lmm_client)\n",
    "\n",
    "    img_uri = pil_to_data_uri(image_data)  # cast data into URI (str) format for API call\n",
    "    lmm_out = lmm_client.create(\n",
    "        context=None,\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": [\n",
    "                    {\"type\": \"text\", \"text\": \"Describe this image in 10 words.\"},\n",
    "                    {\n",
    "                        \"type\": \"image_url\",\n",
    "                        \"image_url\": {\n",
    "                            \"url\": img_uri,\n",
    "                        },\n",
    "                    },\n",
    "                ],\n",
    "            }\n",
    "        ],\n",
    "    )\n",
    "    description = lmm_out.choices[0].message.content\n",
    "    description = content_str(description)\n",
    "\n",
    "    # Convert the image into an array of pixels.\n",
    "    pixels = np.array(image_data)\n",
    "\n",
    "    # Calculate the average color.\n",
    "    avg_color_per_row = np.mean(pixels, axis=0)\n",
    "    avg_color = np.mean(avg_color_per_row, axis=0)\n",
    "    avg_color = avg_color.astype(int)  # Convert to integer for color values\n",
    "\n",
    "    # Format the average color as a string description.\n",
    "    caption = f\"\"\"The image is from {image_url}\n",
    "    It is about: {description}\n",
    "    The average color of the image is RGB:\n",
    "        ({avg_color[0]}, {avg_color[1]}, {avg_color[2]})\"\"\"\n",
    "\n",
    "    print(caption)  # For illustration purpose\n",
    "\n",
    "    return caption"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {},
   "outputs": [],
   "source": [
    "agent_with_vision_and_func = AssistantAgent(\n",
    "    name=\"Regular LLM Agent with Custom Func and LMM\", llm_config=gpt4_llm_config\n",
    ")\n",
    "\n",
    "vision_capability_with_func = VisionCapability(\n",
    "    lmm_config=llm_config_4v,\n",
    "    custom_caption_func=my_description,\n",
    ")\n",
    "vision_capability_with_func.add_to_agent(agent_with_vision_and_func)\n",
    "\n",
    "user.send(message=message, recipient=agent_with_vision_and_func, request_reply=True)"
   ]
  }
 ],
 "metadata": {
  "front_matter": {
   "description": "Leveraging multimodal models through two different methodologies: MultimodalConversableAgent and VisionCapability.",
   "tags": [
    "multimodal",
    "gpt-4v"
   ]
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
